MidTerm Telecom Customer Churn Analysis Emma Horton – Data Visualization Midterm Project Repository: https://github.com/HumaDec/DataVizMidterm
For this project, I selected the Telco Customer Churn dataset (originally provided by IBM and available on Kaggle), which includes data on 7,043 telecom customers across 21 columns. Each row represents a customer, with information spanning:
Demographics (e.g., gender, senior citizen status)
Subscription details (e.g., contract type, internet service)
Billing metrics (e.g., monthly and total charges)
Churn status
This dataset was chosen because of its rich mix of categorical (e.g., Contract, Churn) and quantitative variables (e.g., Tenure, MonthlyCharges), which are well-suited for visualization and exploratory analysis.
Libraries Used:
library(ggplot2)
library(dplyr)
library(plotly)
library(lsr)
library(broom)
Data Loading To ensure proper parsing of blank fields (especially in TotalCharges), I used na.strings = c(““). This helps convert empty strings into NA values.
telco <- read.csv("Telco-Customer-Churn.csv", na.strings = c(""))
#str(telco)
#summary(telco)
#head(telco, 5)
Data Cleaning Convert SeniorCitizen to a labeled factor
Convert multiple character variables to factors
Convert TotalCharges to numeric (after removing empty string values)
Standardize “No internet service”/“No phone service” to “No”
Drop customerID, which is not useful
# SeniorCitizen to factor
telco$SeniorCitizen <- factor(telco$SeniorCitizen, levels=c(0,1), labels=c("No", "Yes"))
# Character columns to factors
cat_cols <- c("gender","Partner","Dependents","PhoneService",
"MultipleLines","InternetService","OnlineSecurity","OnlineBackup",
"DeviceProtection","TechSupport","StreamingTV","StreamingMovies",
"Contract","PaperlessBilling","PaymentMethod","Churn")
telco[cat_cols] <- lapply(telco[cat_cols], factor)
# Fix TotalCharges: convert to numeric
telco$TotalCharges <- as.numeric(as.character(telco$TotalCharges))
# Replace "No internet service" and "No phone service" with "No", replace "No phone service" with "No"
no_internet_cols <- c("OnlineSecurity","OnlineBackup","DeviceProtection",
"TechSupport","StreamingTV","StreamingMovies")
for(col in no_internet_cols) {
telco[[col]] <- factor(ifelse(telco[[col]]=="No internet service", "No",
as.character(telco[[col]])))
}
telco$MultipleLines <- factor(ifelse(telco$MultipleLines=="No phone service", "No",
as.character(telco$MultipleLines)))
# Drop customerID
telco$customerID <- NULL
# Verify results
#sum(is.na(telco$TotalCharges))
#levels(telco$MultipleLines)
#levels(telco$OnlineSecurity)
Set Color Pallet To maintain consistency and improve the visual appeal of plots, I defined a custom color palette for key variable levels
project_colors <- c(
"Yes" = "#E74C3C",
"No" = "#3498DB",
"Female" = "#9B59B6",
"Male" = "#1ABC9C",
"Month-to-month" = "#F39C12",
"One year" = "#2ECC71",
"Two year" = "#34495E",
"DSL" = "#16A085",
"Fiber optic" = "#D35400",
"No" = "#95A5A6"
)
Exploratory Analysis First, I examined how contract type relates to customer churn. This mosaic plot visualizes the relationship between contract type and churn status. The plot and chi-square test reveal a significant relationship: Month-to-month customers churn at a much higher rate than those with longer-term contracts.
# Cross-tab Churn by Contract
table(telco$Churn, telco$Contract)
##
## Month-to-month One year Two year
## No 2220 1307 1647
## Yes 1655 166 48
# Chi-square
chisq_test <- chisq.test(table(telco$Churn, telco$Contract))
chisq_test
##
## Pearson's Chi-squared test
##
## data: table(telco$Churn, telco$Contract)
## X-squared = 1184.6, df = 2, p-value < 2.2e-16
# mosaic
mosaicplot(table(telco$Contract, telco$Churn),
color = c(project_colors["No"], project_colors["Yes"]),
main = "Churn vs Contract Type",
xlab = "Contract Type", ylab = "Churn")
To better understand customer retention, I explored the distribution of tenure (i.e., how many months a customer has been with the company). The histogram below, faceted by contract type, shows that many customers are either very new (with a spike in the 0–5 month range) or have stayed for the maximum tenure of 72 months. This suggests the company has both a large group of recent sign-ups and a core of long-term, loyal customers, with fewer in the mid-range. Newer customers are at higher risk of churn, while long-tenured customers have demonstrated lasting retention.
ggplot(telco, aes(x = tenure, fill = Churn)) +
geom_histogram(binwidth = 5, color = "black", position = "stack") +
facet_wrap(~ Contract) +
scale_fill_manual(values = project_colors) +
labs(
title = "Distribution of Customer Tenure by Contract Type and Churn Status",
x = "Tenure (months)",
y = "Number of Customers",
fill = "Churn"
) +
theme_minimal()
To explore how contract type influences churn behavior across payment methods, I created a grouped bar chart faceted by contract type. This plot shows that customers who pay via electronic check have the highest churn rates, particularly among those with month-to-month contracts. In contrast, automatic payments (via bank or credit card) are associated with lower churn. This suggests that payment method may reflect customer commitment or other financial factors about the customer, which with targeted retention efforts such as incentives for electronic check users—could help reduce churn.
telco$PaymentMethod <- gsub("\\s*\\(automatic\\)", "", telco$PaymentMethod)
ggplot(telco, aes(x = PaymentMethod, fill = Churn)) +
geom_bar(position = "dodge") +
facet_wrap(~ Contract) +
scale_fill_manual(values = project_colors) +
labs(
title = "Churn Distribution by Payment Method and Contract Type",
x = "Payment Method",
y = "Customer Count",
fill = "Churn"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
To explore how monthly charges differ across contract types, I used a boxplot to compare the distribution of charges for customers on Month-to-month, One-year, and Two-year plans. The results show that Month-to-month customers tend to pay slightly more, with a wider range of charges and more high-cost outliers. In contrast, customers on longer-term contracts have tighter, lower distributions, likely due to discounts or bundling associated with those plans—or potentially due to differences in the types of customers who opt for them.
ggplot(telco, aes(x = Contract, y = MonthlyCharges, fill = Contract)) +
geom_boxplot() +
scale_fill_manual(values = project_colors) +
labs(
title = "Monthly Charges by Contract Type",
x = "Contract Type",
y = "Monthly Charges (USD)"
) +
theme_minimal()
Rather than speculating further, I decided to run a multiple linear
regression to better understand which factors influence monthly charges.
The regression results indicate that Internet service type, contract
length, and optional services (such as StreamingTV and TechSupport)
significantly influence monthly charges. Specifically: Fiber optic
internet and month-to-month contracts are strong predictors of higher
charges. Longer contracts and no internet service are associated with
lower charges. Add-on services also drive costs upward, reflecting more
engaged or high-value customers. These insights align with customer
segmentation patterns and may inform targeted pricing strategies or
personalized plan recommendations. To visualize the regression output, I
provide a bar chart of coefficient estimates.
reg_data <- telco %>%
select(-TotalCharges) %>%
na.omit()
reg_data <- reg_data %>%
mutate(across(where(is.character), as.factor))
model <- lm(MonthlyCharges ~ ., data = reg_data)
summary(model)
##
## Call:
## lm(formula = MonthlyCharges ~ ., data = reg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2285 -0.6140 -0.0057 0.6070 4.8419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.498e+01 6.001e-02 416.193 <2e-16 ***
## genderMale 2.336e-02 2.448e-02 0.954 0.340
## SeniorCitizenYes 1.511e-02 3.565e-02 0.424 0.672
## PartnerYes -3.945e-02 2.958e-02 -1.334 0.182
## DependentsYes 1.287e-02 3.140e-02 0.410 0.682
## tenure 5.209e-06 8.443e-04 0.006 0.995
## PhoneServiceYes 2.005e+01 4.816e-02 416.349 <2e-16 ***
## MultipleLinesYes 5.018e+00 2.955e-02 169.829 <2e-16 ***
## InternetServiceFiber optic 2.496e+01 3.518e-02 709.555 <2e-16 ***
## InternetServiceNo -2.505e+01 4.893e-02 -511.848 <2e-16 ***
## OnlineSecurityYes 5.014e+00 3.224e-02 155.490 <2e-16 ***
## OnlineBackupYes 4.992e+00 3.025e-02 165.060 <2e-16 ***
## DeviceProtectionYes 5.022e+00 3.133e-02 160.289 <2e-16 ***
## TechSupportYes 5.030e+00 3.287e-02 153.037 <2e-16 ***
## StreamingTVYes 9.974e+00 3.207e-02 311.040 <2e-16 ***
## StreamingMoviesYes 9.967e+00 3.209e-02 310.579 <2e-16 ***
## ContractOne year 7.762e-03 3.844e-02 0.202 0.840
## ContractTwo year -2.599e-02 4.629e-02 -0.562 0.574
## PaperlessBillingYes -2.039e-02 2.739e-02 -0.744 0.457
## PaymentMethodCredit card 9.687e-04 3.711e-02 0.026 0.979
## PaymentMethodElectronic check -1.768e-02 3.644e-02 -0.485 0.628
## PaymentMethodMailed check -1.403e-02 3.949e-02 -0.355 0.722
## ChurnYes -2.184e-02 3.262e-02 -0.670 0.503
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.026 on 7020 degrees of freedom
## Multiple R-squared: 0.9988, Adjusted R-squared: 0.9988
## F-statistic: 2.75e+05 on 22 and 7020 DF, p-value: < 2.2e-16
coeffs <- broom::tidy(model) %>%
filter(term != "(Intercept)") %>%
arrange(desc(estimate))
ggplot(coeffs, aes(x = reorder(term, estimate), y = estimate)) +
geom_col(fill = project_colors["Yes"]) +
coord_flip() +
labs(
title = "Regression Coefficients: Impact on Monthly Charges",
x = "Predictor",
y = "Estimated Impact ($)"
) +
theme_minimal()
To explore how churn relates to both tenure and monthly charges, I created a scatter plot where each point represents a customer, colored by churn status.
The visualization reveals that churned customers are heavily concentrated at low tenure levels, supporting earlier findings that many cancellations occur early in the customer lifecycle, often among those on month-to-month contracts. In contrast, churn is rare among long-tenured customers, indicating stronger loyalty or satisfaction over time.
There is also a noticeable cluster of churned customers in the upper-left region of the plot—those with high monthly charges and low tenure—suggesting that new customers with higher bills are more likely to cancel early.
This pattern reinforces the inverse relationship between tenure and churn, and suggests that monthly charges may also contribute to early cancellations. These insights highlight potential opportunities for improving early retention through pricing adjustments, onboarding enhancements, or targeted support for high-risk customer segments.
telco$Churn <- factor(trimws(telco$Churn), levels = c("No", "Yes"))
ggplot(telco, aes(x = tenure, y = MonthlyCharges, color = Churn)) +
geom_point(alpha = 0.6) +
labs(title = "Monthly Charges vs Tenure, by Churn Status",
x = "Tenure (months)", y = "Monthly Charges (USD)") +
scale_color_manual(values = project_colors[c("No", "Yes")]) +
theme_minimal()
To explore the interaction between contract type and internet service on churn behavior, I created a heatmap showing the churn rate across all combinations of these two categorical variables. Each tile represents a unique pairing, with color intensity indicating churn percentage—lighter shades represent lower churn, and deeper reds indicate higher churn.
Month-to-month + Fiber optic customers show the highest churn rate, represented by the most intense red tile. This combination likely reflects customers who face high costs without the commitment of a contract, making them more likely to cancel.
In contrast, Two-year contracts, regardless of internet service, consistently exhibit low churn (often below 10%), shown by the lightest tiles. This reinforces earlier findings about long-term contracts being associated with stronger retention.
DSL users on month-to-month plans have a moderate churn rate—higher than two-year DSL contracts, but not as high as fiber. This could reflect cost differences between DSL and fiber internet.
Interestingly, customers with no internet service on month-to-month plans (likely phone-only users) show relatively low churn, potentially due to lower billing amounts or fewer alternatives.
Overall, the heatmap highlights contract type as the dominant factor in churn, with the most significant vertical contrast between month-to-month and longer-term contracts. Among month-to-month customers, internet service type further differentiates churn risk, with fiber optic customers being the most vulnerable segment.
churn_rate <- telco %>%
group_by(Contract, InternetService) %>%
summarize(churn_pct = mean(Churn == "Yes") * 100)
## `summarise()` has grouped output by 'Contract'. You can override using the
## `.groups` argument.
ggplot(churn_rate, aes(x = Contract, y = InternetService, fill = churn_pct)) +
geom_tile(color="white") +
scale_fill_gradient(low="lightyellow", high="red", name="Churn Rate (%)") +
labs(title="Churn Rate Heatmap: Contract vs Internet Service",
x="Contract Type", y="Internet Service") +
theme_minimal()
To explore how revenue accumulates over time for different customer types, I created a bubble plot showing the relationship between monthly charges, total charges, and contract type, animated across tenure bins. Each bubble represents a customer, with size indicating tenure and color representing contract type.
The visualization shows that month-to-month customers tend to accumulate lower total charges early on, though some persist with high monthly fees. In contrast, two-year contract customers steadily build up higher total charges over time, reflecting greater long-term value despite being fewer in number.
This plot reinforces earlier findings that month-to-month plans are riskier, with higher churn rates and more variability in revenue. Many of these customers churn early, especially those facing high monthly charges, pointing to price sensitivity among new sign-ups. Offering early-stage incentives or onboarding support could improve retention in this group.
At the same time, long-tenured customers (5+ years) contribute significantly to overall revenue. These “loyalists” tend to stay on longer contracts and may benefit from locked-in rates or bundled service discounts. Understanding what keeps them satisfied could inform broader retention strategies.